RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

jmatth · 2025-05-07T16:41:38Z

This has a matching change in viamrobotics/goutils#432 that actually fixes the race condition. Unfortunately the way modmanager is written meant that fixing the race caused a deadlock, which this PR addresses.

dgottlieb · 2025-05-07T17:51:31Z

module/modmanager/manager.go

+		// There is a circular dependency that causes a deadlock if a module dies
+		// while being reconfigured. Break it here by giving up on the restart if we
+		// cannot lock the mananger.
+		if locked := mgr.mu.TryLock(); !locked {


If I'm reading this right, if a module crashes at the same time the mod manager has the lock for an unrelated reason (perhaps adding a resource to a different module), we'll give up on restarting the module? Is there some other means that would restart it?

And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock here is that module reconfiguration does not take the in recovery lock, but would need to (for the purpose of serializing with restarts -- which it definitely does from a logic perspective)?

we'll give up on restarting the module

Correct

Is there some other means that would restart it?

Not that I'm aware of

And I suppose the problem (based on the comment) from just using the mod.inRecoveryLock here is that module reconfiguration does not take the in recovery lock...

Correct, that lock was only ever taken by the restart callback. I actually removed it (and that weird atomic bool) since taking the modmananger lock made it unnecessary.

jmatth · 2025-05-15T14:34:35Z

Ok, this should be good to go now. Sorry for the extra noise that moving module and it's methods into their own file causes. The functional changes are all in modmanager:

When removing a module we now always set pendingRemoval even if it doesn't have any resources
In the unexpected restart handler, we
- Take the manager lock to avoid racing with any Reconfigure calls or similar
  - This also removed the need for separate restart locks / atomics on the module itself so I removed them
- Check if the module is either pending removal or the process is already running, and abandon the restart attempt if so

dgottlieb · 2025-05-15T15:54:16Z

module/modmanager/manager.go

-	// closed.
+	// Always mark pendingRemoval even if there is no more work to do because the
+	// restart handler checks it to avoid restarting a removed module.
+	mod.pendingRemoval = true


Not your code, just scrutinizing this variable now that I know it exists. This variable is also used to "hide" modules that are in shutdown, but otherwise still up. I can't comment off-diff (obligatory: GitHub PR is not a professional CR tool). But in modManager.getModule, we have this code:

func (mgr *Manager) getModule(conf resource.Config) (foundMod *module, exists bool) { mgr.modules.Range(func(_ string, mod *module) bool { var api resource.RPCAPI var ok bool for a := range mod.handles { if a.API == conf.API { api = a ok = true break } } if !ok { return true } for _, model := range mod.handles[api] { if conf.Model == model && !mod.pendingRemoval { foundMod = mod exists = true return false } } return true }) return }

Does that set us up for a race when "delayed" removal of module foo V0 + adding a new component depending on module foo V1?

Most of the mod manager operations are under a global lock (at least the ones related to module process lifetimes). So I suppose we're good. But just asking out loud now that I've been introduced to this nuanced module state.

That's a good point. I don't think this set of changes creates any new races, since it's just setting the pendingRemoval field on modules that are being removed from the manager anyway. That said, it looks like a lot of codepaths to getModule don't take the lock (I guess because they're just reads?), so there may be more preexisting race conditions there.

obligatory: GitHub PR is not a professional CR tool

New 20% project: set up Viam Gerrit instance?

dgottlieb · 2025-05-15T16:00:07Z

module/modmanager/manager.go

+			return
+		}
+
+		// Something else already started a new process while we were waiting on the


I don't understand this. I know you were describing how the code likes to re-use the same module object. But I think we'll want something more satisfying than "something else did something we only know how to observe via a Status call".

What's the sequence of public API calls + module crashes that leads to this?

There are two cases we need to account for: module process crashes while being removed, and module process crashes while being reconfigured. The case this check addresses is the crash during reconfigure. The modmanager will kill and restart the module process, so by the time the exit handler has the lock to proceed a new process may already be running and we need to avoid starting a second one. Here's by best attempt at showing the goroutines interleaving in markdown form:

modmanager goroutine module.process.manage goroutine

manager.Reconfigure() <call to cmd.Wait() returns w/ err>

<manager.mu locked> <module.process.mu locked>

manager.closeModule()

module.stopProcess()

module.process.Stop()

<block on module.process.mu>

<reaches OUE handler, module.process.mu unlocked>

<module.process.mu unlocked, lock it and complete call to Stop> manager.newOnUnexpectedExitHandler

manager.startModule() <block on manager lock>

<various further calls assign a new running managedProcess to module.process>

<manager.mu unlocked>

<unblocked, lock manager.mu>

<module.process.Status() == nil, abort restart>

The first fix I tried was passing in the managedProcess pointer to the exit handler along with the exit code and then checking if it was the same as the pointer stored in module.process. If the pointers differed then we lost the race and a new managedProcess instance had been set up on the module, so we should abort the restart. That felt cleaner than checking the current process status but required changing a public API in pexec so I replaced it with the current implementation. I could probably find/create some other unique value that changes each time the module restarts to accomplish the same thing if we don't want to check the process status. Top-of-mind are:

Create a new module struct each time the module is restarted and compare the module pointers.

Use ManagedProcess.ID() to check if the process instance has changed. The docstring on ManagedProcess.ID() says it "returns the unique ID of the process", but currently we just set it to the module name which in this context is not unique. I could add some random or incrementing value to that string to make it unique but I'm not sure if anything else relies on that ID to map back to module names.

This all makes sense. Thanks for the detailed writeup. I'm going to leave some remarks re: documentation. But in hindsight, what you have is pretty good already.

I also agree with the assessment re: checking a managed process pointer (or something more logical). That would be preferable, but not so preferable to break the API further.

module/modmanager/module.go

dgottlieb · 2025-05-15T20:46:13Z

module/modmanager/manager.go

 		// Log error immediately, as this is unexpected behavior.
 		mod.logger.Errorw(
 			"Module has unexpectedly exited.", "module", mod.cfg.Name, "exit_code", exitCode,
 		)

+		// Take the lock to avoid racing with calls like Reconfigure that may make


Suggested documentation:

Noting that a module process can be restarted while preserving the same `module` object. And consider the case where a module is being restarted due to a configuration change concurrently with the existing module process crashing. There are two acceptable interleavings: 1. The `onUnexpectedExitHandler` restarts the module process with the old configuration. 1a) and the Reconfigure then shuts down + restarts the (freshly launched) module process with one using the updated configuration. 2. Or, the `Reconfigure` executes and starts the module process[1] with the updated config. The `onUnexpectedExitHandler` will still run. But will become a no-op. For the second scenario, we check our assumptions after acquiring the modmanager mutex. If the module process is running, there is nothing for us to do.

[1] I'm seeing that reconfigure has this code for stopping the existing module process:

if err := mod.stopProcess(); err != nil { return errors.WithMessage(err, "error while stopping module "+mod.cfg.Name) }

Have we confirmed that, in our case of racing with a crash, stopProcess there will not return an error?

In both outcomes of the race (either Stop or manage getting the process lock first) the call to managedProcess.Stop() will return an error reporting that the process it tried to kill did not exist, which we whitelist and stop propagating in module.stopProcess, so the module stop code will proceed normally.

dgottlieb

Optimistically approving. Left a comment that just adds more words to the race your documentation alludes to. Feel free to correct any inaccuracies or rewrite to your tastes.

jmatth added 2 commits May 7, 2025 12:15

Break deadlock when a module tries to restart while being reconfigured

5ffed00

Removing redundant synchronization code

4917724

jmatth requested a review from dgottlieb May 7, 2025 16:41

viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 7, 2025

jmatth mentioned this pull request May 7, 2025

RSDK-10618: Fix race contition in managedProcess viamrobotics/goutils#432

Merged

dgottlieb reviewed May 7, 2025

View reviewed changes

jmatth added 3 commits May 9, 2025 10:01

Refactoring module struct into a separate file

9092dcd

Properly handle mutex handoff in unexpected restart callback

160193c

Fixing test that still referenced removed struct field

0317869

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025

Removing use of new type alias from pexec so CI can run

b340efa

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025

Wait on module processes to terminate in test cleanup

0c47dc1

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025

Removing unused method and formatting code to keep the linter happy

7d598f8

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 9, 2025

Bump goutils

82f10c0

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 13, 2025

Clean up duplicate goutils entry in go.mod

e59d45e

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 13, 2025

jmatth force-pushed the RSDK-10618 branch from c0e79a9 to e59d45e Compare May 14, 2025 01:08

viambot removed the safe to test This pull request is marked safe to test from a trusted zone label May 14, 2025

viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 14, 2025

Merge remote-tracking branch 'upstream/main' into RSDK-10618

1739ef0

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025

Fixing race in server shutdown test

3f1a858

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025

Make docstring linter happy

40f7452

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 14, 2025

Preventing crashed module restart when module has been removed

1e799ed

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 15, 2025

jmatth requested a review from dgottlieb May 15, 2025 14:34

dgottlieb reviewed May 15, 2025

View reviewed changes

dgottlieb approved these changes May 15, 2025

View reviewed changes

Add more detailed comments documenting lock usage

ca075e7

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025

Format comments to keep the linter happy

3dc8dc2

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025

Correct comment

cbad1a4

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 16, 2025

jmatth merged commit 915c3bc into viamrobotics:main May 16, 2025
19 checks passed

jmatth deleted the RSDK-10618 branch May 16, 2025 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

Uh oh!

jmatth commented May 7, 2025

Uh oh!

dgottlieb May 7, 2025

Uh oh!

jmatth May 7, 2025

Uh oh!

jmatth commented May 15, 2025

Uh oh!

dgottlieb May 15, 2025

Uh oh!

jmatth May 15, 2025

Uh oh!

jmatth May 15, 2025

Uh oh!

dgottlieb May 15, 2025

Uh oh!

jmatth May 15, 2025 •

edited

Loading

Uh oh!

dgottlieb May 15, 2025

Uh oh!

dgottlieb May 15, 2025

Uh oh!

Uh oh!

dgottlieb May 15, 2025

Uh oh!

jmatth May 16, 2025

Uh oh!

dgottlieb left a comment

Uh oh!

Uh oh!

Uh oh!

modmanager goroutine	module.process.manage goroutine
manager.Reconfigure()	<call to cmd.Wait() returns w/ err>
<manager.mu locked>	<module.process.mu locked>
manager.closeModule()
module.stopProcess()
module.process.Stop()
<block on module.process.mu>
	<reaches OUE handler, module.process.mu unlocked>
<module.process.mu unlocked, lock it and complete call to Stop>	manager.newOnUnexpectedExitHandler
manager.startModule()	<block on manager lock>
<various further calls assign a new running managedProcess to module.process>
<manager.mu unlocked>
	<unblocked, lock manager.mu>
	<module.process.Status() == nil, abort restart>

RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

RSDK-10618: Fix data race when modules crashes during reconfiguration #4975

Uh oh!

Conversation

jmatth commented May 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatth commented May 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatth May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgottlieb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmatth May 15, 2025 •

edited

Loading